Tertiary Storage Organization for Large Multidimensional Datasets
نویسندگان
چکیده
Large multidimensional datasets are found in diverse application areas, such as data warehousing [6], satellite data processing, and high-energy physics [9]. According to current estimates, these datasets are expected to hold terabytes of data. Since these datasets hold mainly historical and aggregate data, their sizes are increasing. Daily accumulation of raw data and jobs generating aggregate data from the raw data are responsible for this increase. Hence, estimates for the dataset sizes run into several petabytes. Though cost per byte as well as area per byte for secondary storage has been dropping, it is still not cost effective to store petabyte-sized datasets in the secondary storage [4]. Efficient storage organization for multidimensional data has been investigated extensively [8, 1, 5]. Chen et al [1] discuss organization of multidimensional data on a hierarchical storage system. The authors prove that the problem of efficient organization of multidimensional data on a one-dimensional storage system, such as tertiary storage, is NP-complete when arbitrary range queries are allowed. They present a five step strategy based on heuristics for the problem. Jagadish et al ([5]) investigated the problem of efficient organization of a data warehouse on secondary storage. The workload consists of a restricted set of range queries using hierarchies defined on the dimensions. They cast the problem as finding an optimal path through a lattice. They propose a dynamic programming based algorithm that determines how various dimensions are laid out. We are not aware of any work that takes into consideration practical constraints like the order in which the data already exists or will be generated. Given an order in which data currently exists (or will be generated), and a limited amount of temporary storage space, we investigate issues in efficiently organizing multidimensional datasets on tertiary storage. We cast the problem as permutation of the input data stream using limited storage space. The rest of this document is organized as follows: The problem is formulated in Section 2. Section 3 describes our approach. In Section 4, we present performance results. Section 5 presents conclusions.
منابع مشابه
Optimizing Tertiary Storage Organization and Access for Spatio-Temporal Datasets
We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. T...
متن کاملHEAVEN: A Hierarchical Storage and Archive Environment for Multidimensional Array Database Management Systems
The intention of this paper is to present HEAVEN, a solution of intelligent management of large-scale datasets held on tertiary storage systems. We introduce the common state of the art technique storage and retrieval of large spatiotemporal array data in the High Performance Computing (HPC) area. An identified major bottleneck today is fast and efficient access to and evaluation of high perfor...
متن کاملTertiary Storage Support for Large-Scale Multidimensional Array Database Management Systems
Many large-scale scientific domains often generate huge amounts (hundreds of terabytes) of multidimensional data. The only practicable way for storing such large volumes of multidimensional data is a tertiary storage system. Unfortunately in commercial multidimensional Database Management Systems (DBMS) the access is optimized for performance with primary and secondary memory. Tertiary storage ...
متن کاملSmart Hierarchical Storage Support for Large-Scale Multidimensional Array Database Management Systems
Large-scale scientific experiments or simulation programs often generate large amounts of multidimensional data. Data volume may reach hundreds of terabytes (up to petabytes). In the present and the near future, the only practicable way for storing such large volumes of multidimensional data is tertiary storage systems. But commercial (multidimensional) database systems are optimized for perfor...
متن کاملHierarchical Storage Support and Management for Large-Scale Multidimensional Array Database Management Systems
Large-scale scientific experiments or simulation programs often generate large amounts of multidimensional data. Data volume may reach hundreds of terabytes (up to petabytes). In the present and the near future, the only practicable way for storing such large volumes of multidimensional data are tertiary storage systems. But commercial (multidimensional) database systems are optimized for perfo...
متن کامل